- 
                Notifications
    You must be signed in to change notification settings 
- Fork 198
Effects: partial CPS transform #1384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
 | 
| I would be nice to be able to propagate information across units during separate compilation. (related to #550) | 
| Would global_flow.ml allow to address #594 ? | 
35129dc    to
    d17a3e6      
    Compare
  
    | 
 I have added a test | 
d2f9d2f    to
    febb7c8      
    Compare
  
    | @pmwhite Do you want to give it a try? | 
| I didn't read in details but It look good from there. The performance wiki page still need to be updated (text and graphs). I would be nice to show timings improvement for non-benchmark progams. | 
| 
 Indeed, I would. Likely some time next week. | 
| Thanks for this work. The improvements are great. IIUC the benchmarks don't use effect handlers. I would also be interested in seeing the improvements in programs that use effect handlers. In the PLDI 21 paper on effect handlers, we studied the performance of effect handlers using two small benchmarks -- chameneos redux (Section 6.3.1) and generators (Section 6.3.2). The source code for the benchmarks is here: https://github.com/kayceesrk/code-snippets/tree/master/eff_bench. I would be interested in seeing the performance difference between  | 
dbdece8    to
    af22490      
    Compare
  
    | 
 This makes a significant difference as well. Here is a quick measurement: 
 | 
| Thanks for the results. It is good to see partial CPS doing well here. The next question is harder to answer, because it may be ill informed, but let me ask that anyway. On these benchmarks, how close to perfectly precise / optimal performance is the current partial CPS? As in, if you had a chance to only CPS those functions which are absolutely needed in these benchmarks, what would the performance be? | 
| 
 I think the code for  
   let chams = List.map ~f:(fun c -> ref c) colors in
...
  let ns = List.map ~f:MVar.take fs in | 
| Thanks @vouillon for your answer. It helped me put the numbers in perspective. | 
| 
 Do you expect the analysis to be more expansive when effects is off ? | 
| 
 Just tried this patch out. I've run into an issue, I believe with the lexer, which is having trouble parsing column 57 of this line. I assume this has to do with the recent changes to the lexer/parser, and not with this particular PR, but it does block me from testing this PR itself. | 
| 
 
 Should be fixed by #1395 | 
af22490    to
    697de75      
    Compare
  
    | I've now run into the following error: This refers, I believe, to this library; hopefully that reproduces easily enough. | 
| 
 
 | 
| @vouillon, should we merge ? | 
| 
 I still need to update the documentation... | 
241ccb6    to
    3542614      
    Compare
  
    We start from a pretty good ordering (reverse postorder is optimal when there is no loop). Then we use a queue so that we process all other nodes before coming back to a node, resulting in less iterations.
This is useful when the graph changes dynamically
We omit stack checks when jumping from one block to another within a function, except for backward edges. Stack checks are also omitted when calling the function continuations. We have to check the stack depth in `caml_alloc_stack` for the test `evenodd.ml` to succeed. Otherwise, popping all the fibers exhaust the JavaScript stack. We don't have this issue with the OCaml runtime since it allocates one stack per fiber.
I think the issue only occurs when optimization of tail recursion is enabled
We analyse the call graph to avoid turning functions into CPS when we know that they don't involve effects. This relies on a global control flow analysis to find which function might be called where.
3542614    to
    cf53d33      
    Compare
  
    cf53d33    to
    43b3650      
    Compare
  
    | 
 Hi, I'm surprised to see in this benchmark that partial CPS is in some cases much slower ? (The median is 0.65 faster than full CPS, but in some cases it's > 4 times slower, meaning 25/30 times slower than no CPS. Here are the tests for which partial is slower:  | 
| The slower tests seems to be at the end of the table. If the order correspond the the order of execution, maybe something happened to the machine while running the tests .... Another explanation could be that some control flow are exception based and the full cps version would  | 
| 
 That was my hypothesis as well. We should try to reproduce this at some point. Unfortunately, compiling this benchmarks is a bit complicated when you are not at Jane Street... | 

We identify functions that don't involve effects by analyzing the call graph and we keep then in direct style. This relies on a global control flow analysis to find which function might be called where.
The analysis is very effective on small / monomorphic programs.
hammingis somewhat slower since it uses lazy values (we don't analyze mutable values).nucleicis faster since the global control flow analysis is used to avoid some slow function calls. This measurement was made before Apply functions: optimizations #1358 was merged. The gap is probably narrower now.The analysis is less effective on large programs. Higher-order functions such as
List.iterare turned into CPS and then all functions that calls directly or indirectly such a function needs to be turned into CPS as well. There is also some horizontal contamination, where a function needs to be turned into CPS since it is used in a context which expects a CPS function, and then this impacts all other places it is called. Still,ocamlcis now only about ~10% slower (it is about 60% slower with the released version of Js_of_ocaml).CAMLboy is less than 25% slower (650 FPS instead of 800 FPS).
The size of the generated code is less than 20% larger, a few percents larger when compressed. For a large Web app, I have a 44% increase of generated code (6% when compressed).


Compiling
ocamlcis about 25% slower.